72 research outputs found
Algorithm Engineering for High-Dimensional Similarity Search Problems (Invited Talk)
Similarity search problems in high-dimensional data arise in many areas of computer science such as data bases, image analysis, machine learning, and natural language processing. One of the most prominent problems is finding the k nearest neighbors of a data point q ? ?^d in a large set of data points S ? ?^d, under same distance measure such as Euclidean distance. In contrast to lower dimensional settings, we do not know of worst-case efficient data structures for such search problems in high-dimensional data, i.e., data structures that are faster than a linear scan through the data set. However, there is a rich body of (often heuristic) approaches that solve nearest neighbor search problems much faster than such a scan on many real-world data sets. As a necessity, the term solve means that these approaches give approximate results that are close to the true k-nearest neighbors. In this talk, we survey recent approaches to nearest neighbor search and related problems.
The talk consists of three parts: (1) What makes nearest neighbor search difficult? (2) How do current state-of-the-art algorithms work? (3) What are recent advances regarding similarity search on GPUs, in distributed settings, or in external memory
How Good Is Multi-Pivot Quicksort?
Multi-Pivot Quicksort refers to variants of classical quicksort where in the
partitioning step pivots are used to split the input into segments.
For many years, multi-pivot quicksort was regarded as impractical, but in 2009
a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the
standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al.
introduced an even faster algorithm that uses three pivots. This paper studies
what possible advantages multi-pivot quicksort might offer in general. The
contributions are as follows: Natural comparison-optimal algorithms for
multi-pivot quicksort are devised and analyzed. The analysis shows that the
benefits of using multiple pivots with respect to the average comparison count
are marginal and these strategies are inferior to simpler strategies such as
the well known median-of- approach. A substantial part of the partitioning
cost is caused by rearranging elements. A rigorous analysis of an algorithm for
rearranging elements in the partitioning step is carried out, observing mainly
how often array cells are accessed during partitioning. The algorithm behaves
best if 3 to 5 pivots are used. Experiments show that this translates into good
cache behavior and is closest to predicting observed running times of
multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots
from a sample affects sorting cost. The study is theoretical in the sense that
although the findings motivate design recommendations for multipivot quicksort
algorithms that lead to running time improvements over known algorithms in an
experimental setting, these improvements are small.Comment: Submitted to a journal, v2: Fixed statement of Gibb's inequality, v3:
Revised version, especially improving on the experiments in Section
Simple and Fast BlockQuicksort using Lomuto's Partitioning Scheme
This paper presents simple variants of the BlockQuicksort algorithm described
by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using
Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to
partition the input. To achieve a robust sorting algorithm that works well on
many different input types, the paper introduces a novel two-pivot variant of
Lomuto's partitioning scheme. A surprisingly simple twist to the generic
two-pivot quicksort approach makes the algorithm robust. The paper provides an
analysis of the theoretical properties of the proposed algorithms and compares
them to their competitors. The analysis shows that Lomuto-based approaches
incur a higher average sorting cost than the Hoare-based approach of
BlockQuicksort. Moreover, the analysis is particularly useful to reason about
pivot choices that suit the two-pivot approach. An extensive experimental study
shows that, despite their worse theoretical behavior, the simpler variants
perform as well as the original version of BlockQuicksort.Comment: Accepted at ALENEX 201
Reproducibility Companion Paper: Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN
National Research Foundation (NRF) Singapore under NRF Fellowship Programm
ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms
This paper describes ANN-Benchmarks, a tool for evaluating the performance of
in-memory approximate nearest neighbor algorithms. It provides a standard
interface for measuring the performance and quality achieved by nearest
neighbor algorithms on different standard data sets. It supports several
different ways of integrating -NN algorithms, and its configuration system
automatically tests a range of parameter settings for each algorithm.
Algorithms are compared with respect to many different (approximate) quality
measures, and adding more is easy and fast; the included plotting front-ends
can visualise these as images, plots, and websites with interactive
plots. ANN-Benchmarks aims to provide a constantly updated overview of the
current state of the art of -NN algorithms. In the short term, this overview
allows users to choose the correct -NN algorithm and parameters for their
similarity search task; in the longer term, algorithm designers will be able to
use this overview to test and refine automatic parameter tuning. The paper
gives an overview of the system, evaluates the results of the benchmark, and
points out directions for future work. Interestingly, very different approaches
to -NN search yield comparable quality-performance trade-offs. The system is
available at http://ann-benchmarks.com .Comment: Full version of the SISAP 2017 conference paper. v2: Updated the
abstract to avoid arXiv linking to the wrong UR
Parameter-free Locality Sensitive Hashing for Spherical Range Reporting
We present a data structure for *spherical range reporting* on a point set
, i.e., reporting all points in that lie within radius of a given
query point . Our solution builds upon the Locality-Sensitive Hashing (LSH)
framework of Indyk and Motwani, which represents the asymptotically best
solutions to near neighbor problems in high dimensions. While traditional LSH
data structures have several parameters whose optimal values depend on the
distance distribution from to the points of , our data structure is
parameter-free, except for the space usage, which is configurable by the user.
Nevertheless, its expected query time basically matches that of an LSH data
structure whose parameters have been *optimally chosen for the data and query*
in question under the given space constraints. In particular, our data
structure provides a smooth trade-off between hard queries (typically addressed
by standard LSH) and easy queries such as those where the number of points to
report is a constant fraction of , or where almost all points in are far
away from the query point. In contrast, known data structures fix LSH
parameters based on certain parameters of the input alone.
The algorithm has expected query time bounded by , where
is the number of points to report and depends on the data
distribution and the strength of the LSH family used. We further present a
parameter-free way of using multi-probing, for LSH families that support it,
and show that for many such families this approach allows us to get expected
query time close to , which is the best we can hope to achieve
using LSH. The previously best running time in high dimensions was . For many data distributions where the intrinsic dimensionality of the
point set close to is low, we can give improved upper bounds on the
expected query time.Comment: 21 pages, 5 figures, due to the limitation "The abstract field cannot
be longer than 1,920 characters", the abstract appearing here is slightly
shorter than that in the PDF fil
- …